R Beginners Course

Data processing

Objective and contents

Objective and contents

Objective: to gain insights related to the organisation and pre-processing of gridded data as well as to work with time series data.

  • Data organisation
  • Accessing and importing particular files
  • Pre-processing data
  • Aggregating data
  • Generation of time series from gridded datasets
  • Extra Considerations

Data organisation

Data organisation

Data organisation is the practice of classifying and organising data to make it more usable.

  • Consistent and clear names
  • File titles as short as possible
  • Consistent version control
  • Creation of data dictionaries

Data organisation

It is preferable to save files (corresponding to a time series) with the date (and/or time of day) in the name because:

  • The files will be accessed in the correct order
  • It is easier to apply general codes
  • We are less likely to make errors

Data organisation

The alphabetical organisation of files should correspond to the chronological order of the files.

  • Therefore, the year should come before the month and then the month before the date

Data organisation

File name consistency (length and style).

  • We suggest that each file for a certain dataset should have the same base and then a numeric identifier (for date, iteration number, scenario, etc.)

  • To ensure the same length of file names, months and days from 1 to 9 should be written from 01 to 09

Exporting tables (csv, txt, zoo, etc.).

  • Same data always in the same location (i.e. in the same row or column)

  • Make sure that values with no data are not recorded as zeroes!

Accessing and importing particular files

Accessing and importing particular files

Sometimes we only want to use particular files for a certain analysis. Therefore, it is useful to learn how to subset data from a directory.

We have already seen that we can list all of the files within a folder:

ssebop_files <- list.files("C:/User/SSEBop", full.names = TRUE)

Selecting Particular Files in R

To select a subset of these files we can use the grep function. This functions looks for a predefined pattern in the files and returns the position of each element.

grep("2011", basename(ssebop_files))
##  [1] 13 14 15 16 17 18 19 20 21 22 23 24

If the base name of the files does not contain the pattern, we could use the entire path and the parameter value as follows:

grep("2011", ssebop_files, value = TRUE)
##  [1] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201101.tif"
##  [2] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201102.tif"
##  [3] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201103.tif"
##  [4] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201104.tif"
##  [5] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201105.tif"
##  [6] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201106.tif"
##  [7] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201107.tif"
##  [8] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201108.tif"
##  [9] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201109.tif"
## [10] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201110.tif"
## [11] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201111.tif"
## [12] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201112.tif"

Selecting Particular Files in R

In the case that we want to use the vector of positions, we have to subset the vector of files as follows:

pos         <- grep("2011", basename(ssebop_files))
ssebop_2011 <- ssebop_files[pos]
head(ssebop_2011)
## [1] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201101.tif"
## [2] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201102.tif"
## [3] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201103.tif"
## [4] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201104.tif"
## [5] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201105.tif"
## [6] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201106.tif"

This could be done in one line:

ssebop_2011 <- ssebop_files[grep("2011", basename(ssebop_files))]

Selecting Particular Files in R

What if we only want the files for every March?

ssebop_march <- ssebop_files[grep("03.tif", basename(ssebop_files))]
head(ssebop_march)
## [1] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201003.tif"
## [2] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201103.tif"
## [3] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201203.tif"
## [4] "../Data/L4_Data_Processing_Data/SSEBop//SSEBop_201303.tif"

This works because file names are logical and consistent!

Pre-processing data

Pre-processing data

We typically need to pre-process raw data so that it can be used in the way that we require.

  • To pre-process data, we need to know how the raw data is stored
  • This means we need to find out some or all of the following:
    • Whether a scaling (and/or offset) factor needs to be applied
    • What are the units
    • How NA values have been stored
  • Projection system
  • In this example, we will look at the CHIRPSv2 precipitation product

Pre-processing CHIRPSv2

Load the CHIRPSv2 precipitation monthly raw data for January, 2000.

chirps_path <- "C:/User/CHIRPS_raw/chirps-v2.0.2000.01.tif"

Now, we can import the raster using the terra package.

library(terra)
chirps_raw <- rast(chirps_path)

Pre-processing CHIRPSv2

plot(chirps_raw)

Pre-processing CHIRPSv2

CHIRPSv2 only provides in-land precipitation. NA values (i.e., where no estimate is made) are stored in the raw data as equal to -9999.

  • So how can we create a dataset that we can then use everywhere?
chirps_raw[chirps_raw == -9999] <- NA
plot(chirps_raw)

Pre-processing CHIRPSv2

Now that the data has been pre-processed, we can save it as a new raster.

writeRaster(chirps.raw, "C:/User/Result/CHIRPSv2.0_200001.tif", overwrite = TRUE)

Aggregating data

Aggregating data

Sometimes, we have to aggregate data for analysing different temporal scales. One product can be retrieved at daily scale but monthly or annual estimates may be required. In this example, we want to derive an annual precipitation raster from the monthly CHIRPS rasters for the years 2000-2003.

  • In this example, we will deal with raster files that are already pre-processed (i.e., the -9999 values have already been converted to NAs)
chirps_files <- list.files("C:/User/CHIRPS_monthly", full.names = TRUE)

Aggregating data

head(chirps_files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-01.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-02.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-03.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-04.tif"
## [5] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-05.tif"
## [6] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-06.tif"
tail(chirps_files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-07.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-08.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-09.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-10.tif"
## [5] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-11.tif"
## [6] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2003-12.tif"

Aggregating data

Now, we want to generate annual raster files by adding the 12 monthly files for each year.

  • We need first to extract the years from the filenames
years <- substr(basename(chirps_files), 15, 18)
print(years)
##  [1] "2000" "2000" "2000" "2000" "2000" "2000" "2000" "2000" "2000" "2000"
## [11] "2000" "2000" "2001" "2001" "2001" "2001" "2001" "2001" "2001" "2001"
## [21] "2001" "2001" "2001" "2001" "2002" "2002" "2002" "2002" "2002" "2002"
## [31] "2002" "2002" "2002" "2002" "2002" "2002" "2003" "2003" "2003" "2003"
## [41] "2003" "2003" "2003" "2003" "2003" "2003" "2003" "2003"
  • Now, we will use the unique function to get the different years included in our vector
unique_years <- unique(years)
print(unique_years)
## [1] "2000" "2001" "2002" "2003"

Aggregating data

How can we write a for loop to generate one annual raster file at a time?

output_dir <- "C:/User/CHIRPSv2_annual"

for(i in 1:length(unique_years)){
  # Here we extract the paths of the year i
  positions <- grep(unique_years[i], basename(chirps_files))
  files     <- chirps_files[positions]
  
  # Then we stack the monthly files of the year i
  files <- rast(files)
  
  # Now we have to add the monthly precipitation to get the annual precipitation
  annual <- sum(files)

  # Finally we can export the file. However, we first have to create a different (and identifiable)
  # name for each year. So we will include the year into the new file name.
  name          <- paste0("CHIRPS_annual_", unique_years[i])
  names(annual) <- name
  name          <- paste0(name, ".tif")
  writeRaster(annual, file.path(output_dir, name), overwrite = TRUE)
}

Aggregating data

If you want to look at how it is working step by step we suggest setting i = 1 (or any other number, and then running each step individually and printing/plotting the results after each step.

i <- 1
positions <- grep(unique_years[i], basename(chirps_files))
print(positions)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
files <- chirps_files[positions]
head(files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-01.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-02.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-03.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-04.tif"
## [5] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-05.tif"
## [6] "../Data/L4_Data_Processing_Data/CHIRPS_monthly/chirps_monthly2000-06.tif"

Aggregating data

files <- rast(files)
print(files)
## class       : SpatRaster 
## dimensions  : 2000, 7200, 12  (nrow, ncol, nlyr)
## resolution  : 0.05, 0.05  (x, y)
## extent      : -180, 180, -50, 50  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## sources     : chirps_monthly2000-01.tif  
##               chirps_monthly2000-02.tif  
##               chirps_monthly2000-03.tif  
##               ... and 9 more source(s)
## names       : chirp~00-01, chirp~00-02, chirp~00-03, chirp~00-04, chirp~00-05, chirp~00-06, ... 
## min values  :           0,           0,           0,           0,           0,           0, ... 
## max values  :    1251.456,    1467.558,    1222.492,    1224.301,    1322.237,    1584.747, ...
annual <- sum(files)

Aggregating data

plot(annual)

Aggregating data

name          <- paste0("CHIRPS_annual_", unique_years[i])
names(annual) <- name
name <- paste0(name, ".tif")
print(name)
## [1] "CHIRPS_annual_2000.tif"
writeRaster(annual, file.path(output_dir, name), overwrite = TRUE)

Let’s have a look at what is now saved in the CHIRPS_annual folder.

Aggregating data

Now, in R:

annual_files <- list.files(output_dir, full.names = TRUE)
print(annual_files)
## [1] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2000.tif"
## [2] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2001.tif"
## [3] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2002.tif"
## [4] "../Data/L4_Data_Processing_Data/CHIRPS_annual/CHIRPS_annual_2003.tif"

Let’s plot the annual rasters.

annual_files <- rast(annual_files)

Aggregating data

plot(annual_files)

The %in% command

We could also have used the %in% command in the previous example. This command returns the positions of specific values in a vector.

a <- c(10, 10, 20, 20, 30, 30, 40, 40)
a %in% 20
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

If we use the %in% command in combination with the which command:

which(a %in% 20)
## [1] 3 4

The Previous Example with the %in% Command

for(i in 1:length(unique_years)){
  positions <- which(years %in% unique_years[i])
  files <- chirps_files[positions]
  
  # Then we stack the monthly files of the year i
  files <- rast(files)
  
  # Now we have to add the monthly precipitation to get the annual precipitation
  annual <- sum(files)
  
  # Finally we can export the file. However, we first have to create a different (and identifiable)
  # name for each year. So we will include the year into the new file name.
    name          <- paste0("CHIRPS_annual_", unique_years[i])
  names(annual) <- name
  name          <- paste0(name, ".tif")
  writeRaster(annual, file.path(output_dir, name), overwrite = TRUE)
}

Generation of time series from gridded datasets

Processing Time Series Data

In many instances, we will be working with long time series and will want to extract data over a particular period. We have already looked at how to extract data from particular columns (see example from L2).

Here, we will look at subsetting data according to period and calculating monthly statistics. For this purpose, we will use the hydroTSM package, which is designed to efficiently handle time series. Even though hydroTSM was designed for the field of hydrology, it is an efficient package for managing time series, regardless of the scientific field in which we are working.

install.packages("hydroTSM")
library(hydroTSM)

The zoo Package

The zoo package is useful when we have to work with time series. While the read.table or read.csv commands return the dates in the respective column, the zoo object includes the dates in the index of the object.

Please note that the hydroTSM package loads the zoo package

Reading the Data

We will first read the csv file:

temp_data <- read.csv("C:/User/Daily_Temperature.csv")
head(temp_data)
##         Date        tmax         tmin
## 1 1950-01-01 16.87414245  2.267171503
## 2 1950-01-02 14.12995701  7.472945504
## 3 1950-01-03 12.87256241 -0.322433109
## 4 1950-01-04  12.7848237 -3.291659449
## 5 1950-01-05 15.15733243 -3.763536855
## 6 1950-01-06 20.72999248 -5.898691615
dim(temp_data)
## [1] 25567     3
summary(temp_data)
##      Date               tmax               tmin          
##  Length:25567       Length:25567       Length:25567      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

NA values

The NA values have been saved as this text string in the csv file provided (#N/A). We will reread the csv file, specifying the na.strings parameter to interpret the NAs correctly.

temp_data <- read.csv("C:/User/Daily_Temperature.csv", na.strings = "#N/A")
head(temp_data)
##         Date     tmax       tmin
## 1 1950-01-01 16.87414  2.2671715
## 2 1950-01-02 14.12996  7.4729455
## 3 1950-01-03 12.87256 -0.3224331
## 4 1950-01-04 12.78482 -3.2916594
## 5 1950-01-05 15.15733 -3.7635369
## 6 1950-01-06 20.72999 -5.8986916

NA values

tail(temp_data)
##             Date     tmax      tmin
## 25562 2019-12-26 14.67704  7.608191
## 25563 2019-12-27       NA        NA
## 25564 2019-12-28       NA        NA
## 25565 2019-12-29       NA        NA
## 25566 2019-12-30 15.46794  7.934365
## 25567 2019-12-31 21.43409 10.980799
summary(temp_data)
##      Date                tmax             tmin       
##  Length:25567       Min.   : 7.407   Min.   :-7.701  
##  Class :character   1st Qu.:21.273   1st Qu.: 7.584  
##  Mode  :character   Median :25.868   Median :12.026  
##                     Mean   :25.874   Mean   :11.614  
##                     3rd Qu.:30.341   3rd Qu.:15.742  
##                     Max.   :45.716   Max.   :27.670  
##                     NA's   :1881     NA's   :1843

Conversion to a Zoo Object

We will use the zoo function to convert the object temp.data to a zoo object.

zoo(x, order.by)

For our example:

temp_data <- zoo(temp_data[, c(2, 3)], order.by = as.Date(temp_data[, 1]))

Conversion to a Zoo Object

The dates do not correspond to a column of data! There are only two columns (tmax and tmin).

head(temp_data)
##                tmax       tmin
## 1950-01-01 16.87414  2.2671715
## 1950-01-02 14.12996  7.4729455
## 1950-01-03 12.87256 -0.3224331
## 1950-01-04 12.78482 -3.2916594
## 1950-01-05 15.15733 -3.7635369
## 1950-01-06 20.72999 -5.8986916
summary(temp_data)
##      Index                 tmax             tmin       
##  Min.   :1950-01-01   Min.   : 7.407   Min.   :-7.701  
##  1st Qu.:1967-07-02   1st Qu.:21.273   1st Qu.: 7.584  
##  Median :1984-12-31   Median :25.868   Median :12.026  
##  Mean   :1984-12-31   Mean   :25.874   Mean   :11.614  
##  3rd Qu.:2002-07-01   3rd Qu.:30.341   3rd Qu.:15.742  
##  Max.   :2019-12-31   Max.   :45.716   Max.   :27.670  
##                       NA's   :1881     NA's   :1843
dim(temp_data)
## [1] 25567     2

Conversion to a Zoo Object

To access the dates, we use the index function.

dates <- index(temp_data)
head(dates)
## [1] "1950-01-01" "1950-01-02" "1950-01-03" "1950-01-04" "1950-01-05"
## [6] "1950-01-06"
tail(dates)
## [1] "2019-12-26" "2019-12-27" "2019-12-28" "2019-12-29" "2019-12-30"
## [6] "2019-12-31"
index(temp_data)[3:5]
## [1] "1950-01-03" "1950-01-04" "1950-01-05"

Extracting Data for a Time Window

We can use the window function to extract data over a particular time period.

temp_1990s <- window(temp_data, start = "1990-01-01", end = "1999-12-31")
summary(temp_1990s)
##      Index                 tmax            tmin       
##  Min.   :1990-01-01   Min.   :10.59   Min.   :-1.821  
##  1st Qu.:1992-07-01   1st Qu.:21.99   1st Qu.: 9.007  
##  Median :1994-12-31   Median :26.34   Median :13.071  
##  Mean   :1994-12-31   Mean   :26.41   Mean   :12.708  
##  3rd Qu.:1997-07-01   3rd Qu.:30.62   3rd Qu.:16.543  
##  Max.   :1999-12-31   Max.   :44.71   Max.   :26.289  
##                       NA's   :195     NA's   :180

Monthly Statistics

We can use the daily2monthly function to calculate statistics for each month.

daily2monthly(x, FUN, na.rm = TRUE/FALSE)

x: is the zoo object FUN: function to be applied na.rm: logical. Shoul the NAs be removed?

Note that the hydroTSM package includes conversions to monthly, seasonal and annual values.

Monthly Statistics

For our example, let’s calculate the mean tmax and tmin for each month of the 1990s.

monthly_temp1990s <- daily2monthly(temp_1990s, mean, na.rm = TRUE)
head(monthly_temp1990s)
##                tmax      tmin
## 1990-01-01 22.13839  6.021942
## 1990-02-01 20.89031  6.032163
## 1990-03-01 22.10266 10.087943
## 1990-04-01 23.71689 13.046162
## 1990-05-01 25.62137 13.678022
## 1990-06-01 30.10750 16.801449
summary(monthly_temp1990s)
##      Index                 tmax            tmin       
##  Min.   :1990-01-01   Min.   :18.73   Min.   : 5.058  
##  1st Qu.:1992-06-23   1st Qu.:22.24   1st Qu.: 9.054  
##  Median :1994-12-16   Median :26.21   Median :12.645  
##  Mean   :1994-12-16   Mean   :26.35   Mean   :12.649  
##  3rd Qu.:1997-06-08   3rd Qu.:30.21   3rd Qu.:16.650  
##  Max.   :1999-12-01   Max.   :35.33   Max.   :20.311  
##                       NA's   :1       NA's   :1

Visualisation of time series

To visualise the time series stored in a zoo object as a plot:

plot(monthly_temp1990s)

Extra Considerations

Temporary files

Now that you have already processed some large data on your computer, let’s have a look at how much space has been used up in writing temporary files.

  • Go to the following file path in File Manager: C:

  • If you’re analysing large amounts of data, the temporary files will start taking up a lot of room on your computer. There are a few options:

    1. Periodically delete temporary files from your computer

    2. Tell R to write the raster temporary files to another location (e.g., an external drive). You will still need to delete them later. Note that this is only for the raster files.

Temporary files

To change the directory where the temporary files are stored you can use the rasterOptions function.

rasterOptions(tmpdir = "file path where temporary raster files will be stored")

Additionally, you could add code to delete temporary files. This can be done with the file.remove function.

Parallel computation in R

Each core of a CPU can perform operations separately from the others. If no special instructions are given, R will only use one computer core. If you need to process very large amounts of data, please look at the following links: